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Abstract 



As an alternative to variable selection or shrinkage in high dimensional regression, 
we propose to randomly compress the predictors prior to analysis. This dramatically 
reduces storage and computational bottlenecks, performing well when the predictors 
can be projected to a low dimensional linear subspace with minimal loss of information 
about the response. As opposed to existing Bayesian dimensionality reduction ap- 
proaches, the exact posterior distribution conditional on the compressed data is avail- 
able analytically, speeding up computation by many orders of magnitude while also 
bypassing robustness issues due to convergence and mixing problems with MCMC. 
Model averaging is used to reduce sensitivity to the random projection matrix, while 
accommodating uncertainty in the subspace dimension. Strong theoretical support is 
provided for the approach by showing near parametric convergence rates for the pre- 
dictive density in the large p small n asymptotic paradigm. Practical performance 
relative to competitors is illustrated in simulations and real data applications. 

Key Words: Compressed sensing; Data compression; Dimensionality reduction; Large p, 
small n; Random projection; Sparsity; Sufficient dimension reduction. 

1 Introduction 

With recent technological progress, it is now routine in many disciplines to collect data 
containing massive numbers of predictors, ranging from thousands to millions or more. In 
such settings, it is commonly of interest to consider regression models such as 



where X is an n x p matrix of predictors, p ^> n, y is an n x 1 response vector and 
e ~ N n (0, o~ 2 I n ) is a residual vector. As traditional techniques such as maximum likelihood 
cannot be used, a rich variety of alternatives have been proposed ranging from penalized op- 
timization methods, such as Lasso (Tibshirani, 1996) and elastic net (Zhou et al., 2004), to 
Bayesian variable selection or shrinkage methods, such as Bayesian Lasso (Park et al., 2008; 
Hans, 2009), horseshoe (Carvalho et al., 2009, 2010) and generalized double Pareto (Armagan 
et al., 2012). The Bayesian approach provides a probabilistic characterization of uncertainty 
in the high-dimensional regression coefficients and in the resulting predictions, while penal- 
ization methods tend to focus on point estimation. There is a recent literature showing 



y = Xj + e, 




optimality properties of Bayesian variable selection and shrinkage in high-dimensional set- 
tings, allowing the number of candidate predictors to increase exponentially with the sample 
size while imposing sparsity constraints (Jiang, 2007; Castillo and van der Vaart, 2012; 
Armagan et al., 2012; Strawn et al., 2012; Bhattacharya et al., 2013). 

This literature focuses on properties of the exact posterior, and computability problems 
are encountered in practical applications involving many predictors. For example, it is well 
known that computing the posterior under Bayesian variable selection priors is an intractable 
NP-hard problem, so that one can at best hope for a rough approximation using Markov 
chain Monte Carlo (MCMC) sampling unless p is small. A commonly used approach is to 
approximate the posterior with a computationally tractable distribution. This gives rise to 
variational Bayes approximations, which are popular in other disciplines (Girolami et al., 
2006; Titsias et al., 2009) and have recently started to infiltrate the statistical literature 
(Faes et al., 2011; Ormerod et al., 2012). One notable disadvantage of variational Bayes is 
that, except in simple cases such as exponential family models, there is a lack of theoretical 
justification in terms of accuracy of the approximation or other performance metrics. 

We propose a new approach for high-dimensional regression problems based on random 
projections of the scaled predictor vector prior to analysis. This Bayesian compressed re- 
gression (BCR) method solves several problems simultaneously. There is no computational 
bottleneck, and scaling to enormous p is trivial. Our approach simply calculates a conju- 
gate Gaussian-inverse gamma posterior for the regression coefficients corresponding to the 
compressed predictors in parallel for different random projections having different subspace 
dimensions. Model averaging (Raftery et al. 1997) is then employed to yield a single poste- 
rior predictive distribution. Notably, there is no MCMC sampling and one can obtain the 
predictive distribution extremely rapidly even in problems with huge numbers of predictors. 

An important issue is the question of theoretical justification. Jiang (2007) showed that 
carefully tailored Bayesian variable selection priors lead to near parametric rates in estimat- 
ing the predictive distribution f(y\x) in settings involving massive dimensional predictors 
under near sparsity constraints. This is an impressive theoretical result, which is consistent 
with the excellent performance observed for Bayesian variable selection in practice. However, 
Jiang's result is for the true posterior distribution, which, as mentioned above, cannot be 
computed accurately in large p problems. We show that our simple Bayesian compressed 
regression procedure enjoys similar theoretical guarantees, but importantly these guarantees 
are for a computable method. In addition, the compressed regression method is expect to 
have excellent performance not only under variable selection sparsity but also in the sufficient 
dimensionality reduction setting in which predictors can be projected to a low dimensional 
linear subspace with minimal loss of information about the response. 

Bayesian compressed regression is inspired by the data squashing literature and recent 
dramatic success of compressive sensing. Data squashing compresses a large number of 
sample points to a smaller representative set, while attempting to yield similar results to 
an analysis of the full data set. One of the early articles in the data squashing literature 
is Dumouchel et al. (1999) where the authors suggest constructing a smaller set of pseudo 
data having matched moments to the original mother data. Madigan et al. (2002) instead 
proposed a model-based approach to achieve likelihood-based clustering of a large data set 
into a smaller number of data points with associated weights. Owen (2003) used empirical 
likelihood-based data squashing, while Lee et al. (2008) proposes a representative sampling 
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approach. The idea of compressive sensing instead focuses on randomly compressing the data 
to facilitate storage and analysis, while retaining the ability to reconstruct the compressed 
signals with high accuracy under sparsity conditions (Donoho, 2006; Candes et a!., 2006). 

Our approach is related to compressive sensing in relying on random projections, but 
there are fundamental differences. Notably, in compressive sensing, one is not interested in 
settings involving predictors and a response but instead wants to compress a huge number 
n of samples of a signal into fewer samples m. Our approach is somewhat orthogonal to 
this in leaving the number of samples n unchanged but instead compressing the number 
of predictors to a smaller dimension. Zhou, Lafferty and Wasserman (2009) proposed a 
compressed regression approach to maintain privacy by replacing ([I]) with 

$j/ = $X7 + $€,e~iV(0,(j 2 /). (2) 

Using an LI regularization approach to obtain a sparse point estimate, they showed oracle 
properties. Clearly, this approach also compresses sample size instead of predictors. 

Section 2 proposes the Bayesian compressed regression (BCR) approach, while providing 
background. Section 3 provides theoretical results on convergence rates. Section 4 contains 
a simulation study. Section 5 contains real data applications, and Section 6 concludes with 
a discussion. All proofs are included in an Appendix. 



2 Proposed Approach 

2.1 Compressing predictors through random projection 

For subjects i = 1, . . . , n, let yi e y denote a response and x-i = (xa, . . . , Xi P )' e^cff 
denote predictors. We consider compressed regression models having the general form 

Vl ~ f{(<S> Xl )'(3,a}, (3) 

where f(n, a) is a family of distributions having location fi and scale a, $ is an m x p pro- 
jection matrix, and f3 = (/?!,..., (3 m )' are coefficients on the compressed predictors. Unlike 
other dimensionality reduction methods, which rely on projecting high-dimensional predic- 
tors to a lower-dimensional subspace, we do not attempt to estimate $ based on the data, as 
this conveys a daunting computational price. Instead, we draw the elements indepen- 
dently, setting $jj = —\J^{, with probability ip 2 , with probability 2(1 — ip)ip and with 

probability (1 — ip) 2 , with the rows of $ then normalized using Gram-Schmidt orthonor- 
malization. This method of sampling $ is popular in compressed sensing (Dasgupta, 2003, 
2013). In our approach, we estimate ip, which is restricted to (0.1, 1) for numerical stability. 

In implementations of Bayesian compressed regression (BCR), we focus on Gaussian 
linear models, replacing ([!]) with 

y i = ($x i yp + e i , e,~iV(0,(T 2 ), (4) 

where regression coefficients (3 = (/3i, . . . , /3 m )' have a much lower dimension and different 
interpretation than the regression coefficients in ([T]). Since Q is a normal linear regression 
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model with m < n, we are no longer in the high- dimensional setting and can choose usual 
conjugate priors for (f3,cx). In particular, we choose a normal-inverse gamma (NIG) prior, 



09 | a 2 ) ~ iV(0, a 2 S,g), a 2 ~ JG(a, 6), 

leading to an NIG posterior distribution for (f3, a) given y and <&x. In the special case in 
which a, b — > 0, we obtain Jeffrey's prior and the posterior distribution is 

/3|t/~t n (/x,S), (5) 
(7 2 |t/~IG(aiA), (6) 

where ai = n/2, b x = {y'y - y'X<f>' [<$>X'X<$>' + E^ 1 ]^ *X'y} /2, 

= + S^ 1 ] _1 M'y, £ = (2&i/n) + S^ 1 ] _1 , and S) denotes a 

multivariate-i: distribution with z/ degrees of freedom, mean \x and covariance S. 

Hence, the exact posterior distribution of {(3, a 2 ) conditionally on <& is available analyt- 
ically. In practice we are not interested in inferences on (3 directly, but instead would like 
to do predictions or inferences on 7. The predictive of y n +i given x n+ \ and $ for a new 
(n + l)st subject marginalizing out (/3, a 2 ) over their posterior distribution is 

y n +l\x n+ l, y ~ t n [llpred, &lred) > ( 7 ) 



where /v«2 = ($cc n+ i)V, oj.^ = (2&i/n) 1 + (fcajn+i)' (Eg 1 + &X'X&) 1 $av 
To provide a heuristic motivation for BCR, we rewrite Q as 



+1 



2\ 



2/i = x'i&fi + ei = + e i5 ~ iV(0, a 

Assigning a joint NIG prior on (f3,a 2 ) induces a prior on (7, a 2 ) conditionally on the ran- 
domly generated <&, which is kept fixed at its initial generated value and not updated dur- 
ing data analysis. The prior on 7 is clearly a singular distribution that resides on an Tri- 
dimensional hyperplane embedded in TZ P . Typical Bayesian approaches for high-dimensional 
linear regression would instead directly define a prior for 7 that concentrates on a low- 
dimensional linear subspace; BCR accomplishes this indirectly through random projections. 
By avoiding updating of $ and only learning the posterior for /3, we obtain enormous com- 
putational savings while maintaining superior predictive performance in our experience and 
justified theoretically in Section 3. 

On the surface, Q seems reminiscent of sufficient dimensionality reduction (SDR, Cook, 
1998), which attempts to find the smallest subspace S having basis A e lZ pxd , d <C p, sat- 
isfying f(y I X) = f(y I AX). SDR-based approaches simultaneously estimate A and the 
density f(y | A'X) relying on a rich variety of strategies, all of which face severe computa- 
tional bottlenecks in estimating A for even moderately large data sets. For example, the 
elegant Bayesian approach of Ghosh et al. (2010) cannot cope with more than a few dozens 
of predictors and a few dimensional subspace. Instead we free this bottleneck by randomly 
generating <& in advance of the data analysis, and then rely on model averaging to reduce 
sensitivity. 
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2.2 Model averaging 

The approach described in the previous section can be used to obtain a posterior distribution 
for 7 and a predictive distribution for y n+ \ given x n+ i for a new (n + l)st subject condi- 
tionally on the m x p random projection matrix We would like to limit sensitivity of the 
results to the specified m and randomly generated This is accomplished by generating s 
random projection matrices having different (to, ip) values, and then using model averaging 
to combine the results. We let Mi, I = 1, . . . , s, represent ^ with <3? having mi rows and 
ipi ~ {7(0.1, 1). Corresponding to model Mi, we denote <E», f3 and cr 2 by & l \ j3"' and cr 2 ^ 
respectively. Let M = {Mi, ■ ■ • , M s } denote the set of models corresponding to different 
random projections, V = {(y i} Xi),i = 1, . . . , n) denote the observed data, and (y n +i, x n+ i) 
denote the data for a future subject. The predictive density of y n +\ given x n+1 is 

s 

f(y n+1 \x n+1 ,V) = J2f(yn+i\*n+i,Mi,V)P(Mi\V), (8) 

i=i 

where the predictive density of y n +i given x n+ i under projection Aii is given in ^ and the 
posterior probability weight on projection Aii is 



J2UP(V\M h )P(M b y 
Assuming equal prior weights P(Aii) = 1/s. The marginal likelihood under Aii is 

P(V | Aii) = j Pip | Mi, (3 {l) , a 2W )7r(/3 (0 > o- 2{l) )d(3 {l) da 2{l) . (9) 

After a little algebra, one observes that for (jij) with (/3 | a 2 ) ~ 7V(0, cr 2 ^^), 7r(o- 2 ) oc ^ , 

1 2?r(?) 



i n 

2 



Plugging in the above expressions in (|8j), one obtains the posterior predictive distribution as 
a weighted average of t densities. Invoking the Sherman- Woodbury-Morrison matrix identity 
and matrix determinant lemma, one obtains 



(X$'S /3 $X' + 7) 1 = i" - X& (S^ 1 + <$>X'X&) 
IX^'E^X' + I\ = IS^ 1 + QX'X&WEpl. 

With the help of the above identity, different components in ^ can be estimated in parallel 
with the main computational expense being 0(mf) matrix inversions under the Ith random 
projection. As m; < p such inversion can be obtained quickly. On a cluster, one can easily 
average across a massive number s of possible $s. However, we have found that there are 
diminishing gains after a modest number (e.g., s ~ 100) and hence BCR can be implemented 
very rapidly even using a batch implementation in R or Matlab. 
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An important question that remains is how much information is lost in compressing the 
high-dimensional predictor vector to a much lower dimension? In particular, one would 
expect to pay a price for the huge computational gains in terms of predictive performance 
or other metrics. We address this question in two ways. First we consider the theoretical 
performance in prediction in a large p, small n asymptotic paradigm in Section 3. Then, we 
will consider practical performance in finite samples using simulated and real data sets. 

3 Convergence Rates of Predictive Densities 

In this section we study the convergence properties of BCR. Our development follows that 
of Jiang (2007), with some important differences. He studied the convergence rate of the 
predictive distribution in high-dimensional regression models under near sparsity conditions 
using Bayesian variable selection. His results on near parametric convergence rates are for 
the true posterior distribution, which is not computable. Instead we focus on obtaining 
corresponding results for BCR without model averaging, and hence study the large p, small 
n asymptotic performance of a posterior that is computable exactly even for massive p. 

Let fo denote the true conditional density of the response y given predictors x and let 
/ be the random predictive density that we obtain a posterior for. Define the Hellinger 

distance between / and / as d(f, fo) = \J J j{\ff — \fJo) 2 v y {dy)v x {dx), where v x (dx) is the 
unknown probability measure for x and v y (dy) is the dominating measure for / and fo- Our 
convergence rate results focus on 

Efo* [ d (fo, f) > e« | (yu Xi)?=i] < A n , (10) 

for large enough n, and some sequences e„, X n converging to as n — > oo. This shows 
that the posterior probability assigned outside of a shrinking neighborhood around the true 
predictive density f decreases to zero. Ideally, this result would hold even when the number 
of predictors increases more rapidly than the sample size and when e n decreases rapidly. In 
particular, we seek to establish a convergence rate e n of the order of n~ l l 2 up to a log(n) 
factor for the proposed model. Below we describe basic notations to be used throughout this 
section. 

3.1 Notation and Framework 

Letting p n denote the number of predictors for sample size n, we assume that p n is a non- 
decreasing sequence of n. We also let the subspace dimension m n grow with sample size, 
allowing that there may be additional signal discovered as more predictors are considered. 
We assume all the predictors are standardized, \xj\ < 1 for all j. The true density and the 
predictive density of the fitted model are assumed to lie in the class of generalized linear 
models with only parameter f3. We write 

/o(«o) = exp {ya(uo) + b(u ) + c(y)} ; u = x'(3 (11) 
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as the true density and 

y ~ f(u) = exp {ya(u) + b(u) + c(y)} ; u = (<f>x)'f3 (12) 

as the density of the fitted model, where a(z) and b(z) are continuously differentiable func- 
tions with a(z) having nonzero derivatives and $ is an m n xp n matrix. This parametrization 
includes some popular classes of densities including binary probit and logistic regression, lin- 
ear regression with constant variance and so on. The $ is standardized, so that ||3?£c|| < 
\/x. Although <fr grows in size with n, we suppress the dependence on n for notational clarity. 

Corresponding to the true predictive /o, there is a vector of true regression parameters 
/3 . We assume a near sparsity condition on /3 : lim^oo Yl^=i \Pjo\ < °°> implying that 
many elements are small in magnitude. This is a more appealing and weaker condition than 
the more standard exact sparsity assumption. 

Suppose with n samples we observe covariates x±, . . . , x n . We will use the empirical 
measure that puts - probability on each Xi,...,x n as the dominating measure v x on x. The 
dominating measure of (x,y) is taken to be the product of v x {x)v y {y) so that the true joint 
density of (x,y) is f (x' ' (3 G )v x (x)v y (y). 

In studying theoretical properties, we focus on the broader class of GLMs instead of just 
normal linear regression and additionally assume there is no free scale parameter, a standard 
assumption in the literature on high-dimensional regression. Following standard convention, 
we let a 2 = 1 without loss of generality. In addition, we consider two alternative priors 
for (3, with the first letting (3 ~ JV(0, S/3) and the second corresponding to independent 
j3j ~ DE(1). In both cases, under some assumptions, it can be shown that the posterior 
predictive densities achieve near parametric convergence rates. 



3.2 Main Results 

This section describes our main results. 

Theorem 3.1 Let f3 ~ iV(0, E^) apriori and B n and B_ n be the largest and the smallest 
eigenvalues of Hp. Further assume all the covariates are standardized, i.e. \xj\ < 1 and 

lim^oo J2 n \Pjo\ < K. Define D(R) = 1 + R sup \a'(h)\ sup , 9 n = y/m n p n . For a 

\h\<R \h\<R 

sequence e n satisfying < < 1, ne^ — > 00, assume the following to hold 



. } mJog(l/g) _^ Q) log(m n ) _^ Q) m n \ogD(9 n ^8B n nel) ^ q ^ 



nel nei ne 2 



(11) B n < Bm v n , B n > B 1 {log{m n ))- 1 (14) 

M 1 °«^0, ||«,||'>g ^ H +1 )^) , V . = SI (15) 

ne 2 n Bi nei 



then 



E f0 7c [d(f, /o) > 4e n I (y h Xi ) n i=1 ] < Ae~ n ^ /2 for all largen. (16) 

The conditions in (i) are primarily designed to impose a restriction on the size of the model, 
so that the subspace dimension m n cannot grow too rapidly with n. The constraint on the 
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growth of D(6 n y8B n ne 2 ) is, however, difficult to interpret immediately. A close inspection 
tells us that the rate at which D(R) grows is solely dependent on the rate of growth of 
a'(z) and ^fj- For logit, probit and linear regressions with known error variance, ja'^)! an d 

| at most grow linearly with \z\ (Jiang, 2007). Therefore, some additional restrictions 
are imposed on the growth of the number of predictors and subspace dimension. 

The conditions in (ii) impose some constraints on the prior covariance matrix of (3. It is 
evident that these conditions are quite relaxed and can even be satisfied with naive choices 
such as = I. 

The conditions in (iii) characterize the class of feasible matrices restricting upper and 
lower bounds of ||3>£c|| for all observed covariates. Intuitively, compression with $ should 
not take away the power of the covariates to explain response. It is not clear how to choose a 
matrix $ deterministically that satisfies condition (iii). This suggests generating a random 
matrix $ that satisfies condition (iii) with high probability, which is the same approach 
taken in the compressive sensing theory literature. To avoid needless complexities in the 
proof, we assume that condition (iii) holds with probability one. 

Below, we state the second result on the convergence rate with DE(1) priors on the 
components of (3. 



Theorem 3.2 Let f3j 's be assigned DE(1) apriori. Define D(R) = 1+R sup |a'(/i)| sup 

\h\<R \h\<R 

9n — \/ m nVn ■ Further assume that for a sequence < e n < 1 satisfying ne^ — > oo, one has 

m n log(l/e 2 n ) ^ Q log{m n ) ^ Q m n \ogD{±0 n ne 2 n ) ^ Q 
ne 2 n ' ne 2 n ne 2 n 

(ii) * KU , " ; -»0, ®x 2 > 8 ^— \/x = Xl ,...,x n 

ne 2 n ne 2 n 



b'(h) 



a' (ft) 



{III ) 

n—>oo 



lim J2\0 jo \ <K, 



then 



E fo n [d(f, f ) > 4e n | ( Vi , x^ =1 ] < ^~ n ^ 2 for all largen. (17) 



These conditions are similar to those of Theorem 13.11 and hence we omit further discussion. 



From Theorem 3^ and the discussions that follow, it is evident that the convergence 
rate will be highly dependent on the rate at which p n and m n grow with n. Intuitively, a good 
convergence rate should require some control on the number of non-informative predictors. 
This in turn implies that p n should be bounded by some function of n. As far as m n is 
concerned, the theory shows that m n cannot grow above a certain limit. The lower bound 
on the size of m n is controlled by the complex dependence of m n on $ and the predictors 
through condition (iii). All of these considerations are put together in Corollary 3.3 to 
obtain a near parametric convergence rate for the proposed BCR approach. The proof of 
this corollary relies on routine algebraic manipulations and is thus omitted. 

Corollary 3.3 Consider linear regression, logistic regression or probit regression examples. 
Assume that (3 is assigned a N(0,Hp) prior with the largest and smallest eigenvalues of 
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Yip, B n ,B n respectively, satisfying B n < Bm v n , B_ n > Bi log(m n ) _1 , for all large enough 
n, for some positive constants B, B% and v. Suppose \xj\ < 1 for all j. Assume further 
that p n < exp(Cra^) for some C > and some ( £ (0)1); f or a ^ large enough n and 

Pn 

lim \Pjo\ < K < oo. Assume the conditions on the matrix as outlined in Theorem 

n-s>oo j =1 



3.1 



are satisfied for large enough n and the number of rows m n of $ satisfies lo — > ; 



" as 



for some k\ > for all large n. Then we can take the convergence rate in Theorem 3.1 

2 



e n ~n-( 1 -«/ 2 log(n)( fcl+1 )/ 2 



A similar result follows from Theorem 13.21 

Corollary 3.4 Consider linear regression, logistic regression or probit regression examples. 
Assume that (3j 's are assigned independently DE(1) prior. Suppose \xj\ < 1 for all j . Assume 
further that p n < exp(CV') for some C > and some ( £ (0, 1), for all large enough n and 

Pn 

lim \Pjo\ < K < oo. Assume the conditions on the matrix <fr ; as outlined in Theorem 



3.2 



are satisfied for large enough n and the number of rows m n of $ satisfies ^ lo ^)yn ~ ®> 



for some k% > for all large n. Then we can take the convergence rate in Theorem 3JZ as 

t n ^n-^/ 2 \og{nt^' 2 . 



4 Simulation Study 

In this section we compare the out-of-sample predictive performance of model averaged BCR 
to that of Ridge Regression (RR), Lasso (Tibshirani, 1996), partial least squares regression 
(PLSR), Bayesian Lasso (BL; Park et al, 2008), generalized double Pareto (GDP; Armagan 
et al, 2012), and Bridge regression (BR). We also consider an alternative implementation of 
our compression idea in which instead of using conjugate NIG priors with model averaging 
over the projection matrix, we generate a single projection matrix, with shrinkage priors 
specified for the coefficients on the compressed predictors. Following this strategy, we applied 
compressed versions of Bayesian Lasso (CBL) and generalized double Pareto (CGDP). These 
methods are slower than BCR in relying on MCMC but are massively faster than applying 
MCMC with shrinkage priors to the original data when p is enormous. As a default in these 
analyses, we use m = 40, which seems to be a reasonable choice of upper bound for the 
dimension of the linear subspace to compress to. 

To implement BCR, we set to be the identity matrix, which satisfies the restrictions 



in Corollary 3.3 The model averaging step in BCR requires choice of a window over the 
possible dimensions m. Motivated by the theory in Section [3j we choose the window as 
[|~2 * log(p)],min(n,p)} which implies that the number of possible models to be averaged 
across is s = min(n,p) — [2 * log{jp)\ . For MCMC based model implementations, we discard 
the first 2,500 samples as a burn-in and draw inference based on 7,500 samples. 
Moderate dimension cases 

We first consider moderately large p and n cases, assessing how sparsity and changing num- 
ber of samples impact performance. We generate observations from the standard linear 
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regression model with p = 100 predictors generated from the normal distribution having 
cor(xj, x'j) = 0.5^~ J ! We consider the following scenarios. 



First 5 regression coefficients are 1.2, the rest are zero and a 2 = 1, n = 70. 
First 5 regression coefficients are 1.2, the rest are zero and a 2 = 1, n = 110. 
First 15 regression coefficients are 1, the rest are zero and a 2 = 1, n = 70. 
First 15 regression coefficients are 1, the rest are zero and o 2 = 1, n = 110. 
All the regression coefficients are 0.2 and a 2 = 1, n = 70. 



All the regression coefficients are 0.2 and cr z 



1, 71= 110. 



The last two scenarios are referred to as the dense case. Although we have been focusing 
on settings in which sparsity is justified in our theory and motivation, it is also instructive 
to compare performance of BCR to its competitors in the more general case in which the 
p-dimensional predictors can be compressed to a much lower dimensional linear subspace 
without loosing much information about the response yi. These last two cases correspond 
to a one dimensional subspace with no sparsity. Dense cases are motivated by the practical 
applications where each of the covariates has small effect on the outcome. 

In our experiments y and X are centered and the columns of X are standardized to have 
unit variance. To implement LASSO, RR, BR and PLSR we used lars (Hastie et al., 2012), 
MASS (Ripley et al., 2012), monomvn (Gramacy, 2010) and pis packages in R respectively. As 
a default choice suggested in Armagan et al (2012), we fix hyperparameters a = rj = 1 for 
GDP and CGDP. In BL and CBL we put a Gamma(l, 1) prior on the Lasso penalty. 

In each of the six scenarios, we simulate 100 datasets. Table [T] presents the MSPE 
averaged over these simulated datasets where in each dataset MSPE is calculated over the 
same number of held-out observations as the number of training cases. The values in the 
subscripts represents bootstrap standard errors for the averaged MSPEs. This is calculated 
by generating 500 bootstrap samples from the 100 MSPE values, finding averaged MSPE in 
each of these 100 datasets, and then computing its standard error. 

From Table[l]it is evident that in the dense cases (Model 5, Model 6), performance of BCR 
is significantly better than the competing shrinkage and sparsity inducing approaches. Ad- 
ditionally, BCR yields remarkably better MSPE than partial least square regression (PLSR) 
under Model 5, which particularly favors the usage of PLSR. In sparse cases (Model 1, Model 
2) all the sparsity favoring approaches such as LASSO, BL, GDP work better than BCR. 

As sample size increases, performance of BCR improves along with its competitors. For 
cases with higher sample size (Model 2, Model 4), BCR shows competitive performance with 
GDP and BR. We repeated all the experiment after increasing signal to noise ratio and found 
similar ordering in their performances. 

Figure [T] shows empirical coverage probabilities of 95% predictive intervals for the com- 
peting Bayesian models in six scenarios. BCR and GDP have under-coverage in the sparse 
cases with low sample size (Model 1, Model 3) while the other compressed models have satis- 
factory coverage. BCR, however, shows excellent coverage in the dense case with low sample 
size. For the frequentist point estimation approaches, we use a two stage approach: (i) es- 
timate regression coefficients in the first stage; (ii) construct 95% Pis based on the normal 
distribution centered on the predictive mean from the regression model with variance equal 
to the estimated variance in the residuals. The median coverage of LASSO and PLSR are 
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Table 1: Out of sample MSPE x .1 for the competing approaches with bootstrap se x .1 in 
the subscript, with columns 1-6 corresponding to results under Models 1-6, respectively. 

Sparsity level 5 Sparsity level 15 Dense model 
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0.46o.oi 


0-12o.002 


O-Ho.004 


CGDP 


l-23o.034 


0.91o.042 


2.37o.i2 


1.76o.o8 


0.24o.oi2 


0.21o.021 


CBL 




U.UO0.04U 


2 09n 1 n 
^..u^o. 1U 


1 68n ri7 
■ . U / 


II /. ?.!\ Mil 


20n ma 


GDP 


0.39o.on 


0.36o.oi3 


0.46o.029 


■ 36o.oi 


0.52o.oio 


0.47o.024 


BL 


0.30o.oo7 


0.21o.oo6 


0.400.018 


0.23o.007 


0.39o.oo7 


0.29 .oii 


BR 


0.69o.oio 


0.25o.008 


1.36o.o60 


O.280.009 


0.45o.008 


O.2I0.007 


LASSO 


0.13o.003 


0.25o.029 


0.19o.on 


0.70o.o68 


0.49 .oo8 


0.64 .o40 


RR 


0.42q,009 


0.25o.oii 


O.5I0.020 


0.25o.008 


0.37o.oo7 


0.23o.025 


PLSR 


0.34q.007 


0.22q.008 


O.5I0.023 


0.29 .oio 


0.22q.004 


0-17o.005 



Table 2: Median lengths of 95% predictive intervals for the competing approaches. 





Sparsity level 5 


Sparsity level 15 


Dense model 




n = 70 


n = 100 


n = 70 


n = 100 


n = 70 


n = 100 


BCR 


6.06(4.99,7.22) 


4.49(3.54,6.46) 


7.89(6.45,9.61) 


5.30(4.65,5.96) 


4.06(3.43,4.83) 


4.02(3.56,4.68) 


CGDP 


10.46(8.52, 12.61) 


10.57(7.27,12.01) 


13.90(10.91,18.29) 


14.43(10.82,17.20) 


4.70(3.58,7.85) 


4.35(3.71,7.57) 


CBL 


10.98(9.18, 12.92) 


10.90(7.63,12.32) 


14.63(11.83,18.90) 


14.98(11.38,17.83) 


5.01(3.82,8.20) 


4.52(3.86,7.89) 


GDP 


3.92(2.87,5.51) 


4.00(2.87,4.95) 


5.33(3.11,6.97) 


4.66(3.49,5.50) 


6.16(4.81,7.59) 


4.56(3.39,5.27) 


BL 


4.89(3.77,6.12) 


4.56(3.68,5.20) 


7.04(5.75,7.95) 


5.58(4.44,6.36) 


6.06(5.09,6.91) 


5.31(4.38,5.88) 


BR 


17.16(15.49, 18.72) 


6.25(4.79,7.10) 


21.16(19.96,23.17) 


7.27(5.88,8.25) 


14.57(12.37, 16.38) 


5.68(4.75,6.55) 


LASSO 


2.70(1.32,3.46) 


5.85(3.59,9.72) 


2.17(0.86,2.72) 


8.86(4.01, 13.51) 


0.77(0.01,2.84) 


6.28(1.79,9.82) 


RR 


0.03(0.02,0.05) 


2.18(0.80,2.82) 


0.04(0.02,0.06) 


2.25(1.35,3.01) 


0.05(0.02,0.06) 


2.29(0.77,2.62) 


PLSR 


2.76(2.31,3.29) 


3.05(2.71,3.51) 


3.26(2.80,3.86) 


3.64(3.09,4.05) 


2.37(1.87,2.83) 


2.82(2.51,3.30) 



found to be 74% and 54% respectively in Model 1 while RR shows severe under- coverage. 
Coverage of LASSO becomes much worse (55% for Model 3 and 13% for Model 5) as sparsity 
decreases. The coverage of PLSR increases marginally as the sparsity decreases. As sample 
size increases, predictive coverage of BCR improves. In the dense case with higher sample 
size, BCR produces coverage closest to 95%. The coverage of LASSO, PLSR and RR also 
increases with increasing sample size. 

The median length of 95% Pis for each of the methods are provided in Table [2} Among 
the competing Bayesian methods, BCR and GDP have the shortest 95% predictive intervals 
in sparse cases while all the compressed models have wider predictive intervals. This explains 
under- coverage of BCR and GDP compared to the other competitors. In the dense case, 
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compressed approaches have narrow predictive intervals and better coverage. The frequentist 
intervals are narrower in general, likely as a result of ignoring uncertainty in parameter esti- 
mation. With increasing sample size, Pi's for BCR shrinks in size while maintaining better 
coverage. It is promising that BCR has such competitive performance even without consid- 
ering computation time; in the next simulation cases, we consider much higher dimensional 
cases and computation time comparisons. 




BCR BL BR CBL CGDP GDP BCR BL BR CBL CGDP GDP 

(a) Model 1 (b) Model 2 




BCR BL BR CBL CGDP GDP BCR BL BR CBL CGDP GDP 

(c) Model 3 (d) Model 4 




BCR BL BR CBL CGDP GDP BCR BL BR CBL CGDP GDP 

(e) Model 5 (f) Model 6 



Figure 1: Empirical coverage probability of 95% predictive intervals for all the competing 
models. 

High dimension cases 

To assess performance in much higher dimensional cases, we conduct a second set of simu- 
lation studies with n = 110 and p = {15, 000, 20, 000, 25, 000}. The MCMC-based BL, GDP 
and BR implementations become increasingly prohibitive as p increases unless compression 
is employed. Hence, we use compressed versions of Bayesian Lasso (CBL), Generalized dou- 
ble Pareto (CGDP) and Bridge regression (CBR) as competitors. The following two data 
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generating models are considered. 

Model 1: First 5 regression coefficients are 1, the rest are zero and a 2 = 1. 
Model 2: All the regression coefficients are .1 and a 2 = 1. 

In each scenario, we simulate 100 data sets. Table [3] presents the MSPE averaged over 
these simulated datasets where in each dataset MSPE is calculated over 110 held-out obser- 
vations. The values in the subscript represents bootstrap standard errors for the averaged 
MSPEs. Table [3] shows the mean squared out-of-sample prediction errors along with their 
bootstrap standard errors for each of these models with two different sparsity levels. 



Table 3: MSPEx.l for the competing models along with their bootstrap sd x.l 



(n,p) 




Sparsity level 5 






dense model 




(110,15000) 


(110,20000) 


(110,25000) 


(110,15000) 


(110,20000) 


(110,25000) 


BCR 


0.82o.020 


0.78o.on 


O.8O0.024 


0.24 .oi 


0.28o.oi2 


0.32o.oi 


CGDP 


0.84o.oi3 


O.860.013 


0.85o.oi5 


3.23o.45 


3-73o.59 


5.99 .92 


CBL 


0.77 .oii 


0.79 .oii 


0.77o.oi2 


3.12o.42 


3.59 .55 


5.70 .63 


CBR 


O.6O0.007 


O.6I0.008 


O.6O0.007 


2.930.39 


3.32o.43 


5.21 .76 


LASSO 


0.22o.003 


O.2I0.003 


0.190.002 


16.34o. is 


21.98o.24 


25.34o.38 


RR 


0.58o.oo6 


0.59o.oo8 


0.59o.oo7 


15.12o.is 


19.89o.24 


24.95o.33 


PLSR 


0.58o.oo6 


0.59o.oo8 


0.59o.oo7 


15.12o.i9 


19.89o.23 


24.95o.32 



If the true model is sparse, LASSO, RR and CBR are the three best performing methods 
with BCR showing competitive performance. As sparsity decreases, all the sparsity favoring 
models perform poorly. In the dense case, all the compressed models perform significantly 
better than the corresponding sparsity favoring models. BCR, in particular, shows excellent 
performance in this scenario. Figure [2] shows boxplots for the empirical coverage proba- 
bilities of 95% Pis for all the Bayesian models. BCR has coverage probabilities between 
90-98% in all cases. Other compressed regression models are also found to deliver excel- 
lent coverage probabilities. We also compute the coverage probabilities of LASSO, RR and 
PLSR using the two stage plug-in approach. When the sparsity is low, the median cover- 
age probabilities (with 95% CIs) of LASSO are .75(.27, .85), .63(.31, .84) and .47(.02, .79) 
for p = 15,000,20,000,25,000 respectively. For the dense cases, the coverage probabili- 
ties of LASSO are .93(.22, .97), .90(.19,.95) and .93(.13, .96) in p = 15,000,20,000,25,000 
respectively. RR and PLSR are found to suffer from severe under- coverage. 
Computational Speed 

A crucial consideration is comparing alternative methods is computing time. The approach 
of applying MCMC to the compressed data, which is employed in CBL, CGDP and CBR, 
is reasonably fast to implement. Using non-optimized R code implemented on a single 3.06- 
GHz Intel Xeon processor with 4.0 Gbytes of random-access memory running Debian LINUX, 
the computing time for 10,000 iterations of CGDP and CBL in the n = 110 and p = 25, 000 
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Figure 2: Empirical coverage probability of 95% predictive intervals for all the competing 
models. 



cases was only 23.9 seconds. 

One of the major advantages of BCR is the rapid (often essentially instantaneous) com- 
puting time. Important contributors to computation time include data compression, which 
involves Gram-Schmidt orthogonalization of m rows of an m x p matrix, multiplying an 
n x p and p x m matrix, as well as the time to invert low dimensional m x m matrices in 
the process of calculating the posterior and posterior weights. If the sample size n is not 
large, these inversions are very quick. Only the matrix multiplies and the Gram-Schmidt 
orthogonalization involved in the compression convey increasing burden with increasing p. 
Given that the whole model averaging process is embarrassingly parallelizable over different 
choices of m, the computation can be done very quickly using a parallel implementation 
with sufficiently many processors. Even not exploiting any parallelization, one obtains re- 
sults quickly using non-optimized R code on a single 3.06-GHz Intel Xeon processor with 4.0 
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Gbytes of random-access memory running Debian LINUX. 

Figure [3] shows the computational speed comparison between BCR and LASSO for n = 
110. Figure [3] indicates that when p is small, BCR enjoys little computational advantage 
over LASSO. The advantage is substantial as p increases; this becomes particularly notable 
as we scale from tens of thousands of predictors to millions or more, which is becoming 
increasingly common. 
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Figure 3: Computational time in seconds for LASSO and BCR against log of the number of 
predictors. 

These computational comparisons are using a parallel implementation of BCR. This can 
be further improved parallelizing the matrix multiplies involved in the initial data compres- 
sion and Gram-Schmidt orthogonalization. 



5 Molecular Epidemiological Study 

We apply Bayesian compressed regression to data from a molecular epidemiology study. The 
focus is on assessing gene-environment interactions between chemical exposures and variants 
in genes thought to be important in DNA damage and repair pathways. Immortalized cell 
lines were established for 90 individuals chosen to represent the ethnic mix in the United 
States. For these individuals, single nucleotide polymorphisms (SNPs) in DNA damage and 
repair genes were measured; in particular, there were 42,181 SNPs discarding loci for which 
there is no variability among individuals in the sample. Replicated samples over 100 cells 
from each cell line were allocated to 1 of 3 groups: (i) analyzed without treatment, (ii) 
analyzed immediately after exposure to a known genotoxic agent (MMS, H2O2), or (hi) 
analyzed after allowing some time (10, 15, 240 minutes) for DNA repair. We will code these 
groups as "NT", "0" and "Later" respectively. 
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The frequency of DNA strand breaks was measured for each cell using single cell gel 
electrophoresis, which is also known as the comet assay. When subject to electrophoresis, 
the nucleoid of cells with many strand breaks pull apart, changing in shape from a ball to 
resemble a comet. Comet assay image processing software produces multiple measures of 
the amount of DNA in the comet tail, which provides a useful surrogate. The Olive tail 
moment (Olive et al., 1990) has been established as the best of the surrogates in previous 
studies (Dunson et al., 2003), and is defined as the percentage of DNA in the tail of the 
comet multiplied by the length between the center of the comet's head and tail. 

For each cell line for each group and for each genotoxic agent (MMS, H2O2), we compute 
33 quantiles of the Olive tail moments. For cell line i, group j (j = NT, 0, Later) and agent 
k {k = MMS, H2O2), Vi,j,k = {Vi,j,k,Xi ■••) Vi,j,k,^)' is the vector of 33 quantiles of Olive tail 
moments. We derive two sets of response variables from the y^fc's. Let 

1- O i>k = y iAk - y i>NT ,k = (O i>k ,i, O itkt33 )' = change in DNA damage from "NT" to 0. 

2. Ji,k = Vi,o,k - Vi,Later,k = (Ji,k,i, —, Ji,k,m)' = change in DNA damage from to "Later". 

While Oj^'s measure an individual's sensitivity to DNA damage induced by toxic reagents, 
Ji^'s quantify an individual's repair rate after damage. Predictors include the SNPs (xi) and 
the type of genotoxic agent, = 1 for H2O2 and for MMS. We consider the following 
linear regression models: 

O iA i = aE iA i + x'ffi + E iA ix'ff 2 + e itk ,i, e iA i ~ N(0, a 2 ). (18) 

A similar set of 33 independent models have been fitted using the responses Jj^^s. Each 
of these models includes main effects for the SNPs and interactions between the SNPs and 
exposure type, leading to p = 84, 363 predictors. 

Analyses are conducted as in the simulation section, with 10-fold cross validation used. 
We implemented all competing methods for each of the 33 quantiles for the two sets of 
responses, but the standard package used to implement LASSO (lars) failed to converge 
even when we modified tuning parameters, seemingly due to the dimensionality and discrete 
nature of the predictors. Hence, we do not report results for LASSO. To get LASSO to 
run, we used a randomly selected subsample of 30,000 SNPs, and in those analyses we found 
substantially improved out of sample predictive performance for our compressed regression 
methods compared with LASSO and the other competitors uniformly across quantiles. We 
do not present those results here but focus on analysis of the complete data set. 

Below we present squared correlation between the observed and the fitted responses 
for some of the initial quantiles. This is an indicator of the amount of variability in the 
response explained by the predictors. It is evident both from Table [4] and Table [5] that 
all the compressed models show high correlation between observed and predicted responses, 
although compressed Bridge regression (CBR) performs better than the other competitors 
uniformly. PLS performs similar to the BCR and RR shows worse performance than BCR. 
The figures in the brackets present the length of the 95% Pi's for the competing methods. We 
fit responses corresponding to the higher quantiles and found decreasing squared correlation 
between observed and fitted responses. 

Figure g] represents the coverage of 95% Pi's for BCR, CGDP, CBL and CBR, PLSR 
and RR for the two types of responses. It is evident from the plot that while BCR shows 
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Figure 4: Coverage of 95% Pi's for the response variables 



excellent coverage, other compressed approaches have a slight under coverage. This can be 
attributed to the fact that the lengths of the 95% Pi's for CGDP, CBL, CBR are much 
narrower than the lengths for BCR as presented in Table [4] and Table |5j The frequentist 
methods also suffer from some under-coverage for the second set of responses, although they 
show competitive performance with BCR for the first set of responses. 



6 Discussion 



The overarching goal of the proposed approach is to define a practical approximation to 
Bayesian inference in massive dimensional regression settings through the use of random 
compression of the predictors prior to analysis. Given the dramatic computational gains, we 
expect to pay some price in terms of predictive accuracy and were pleasantly surprised that 
this price seemed to be small in most cases. In fact, the approach had better performance 
in terms of mean square prediction error overall than many standard competitors due in 
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Table 4: Squared correlation between the predicted and observed responses for O^.The 
fig ures in the bracket represent length of the 95% Pi's for compressed models. 



responses 


resp3 


resp4 


rcsp5 


rcsp6 


rcsp7 


rcsp8 


resp9 


BCR 


0.55 (15.45) 


0.54 (16.11) 


0.54 (16.52) 


0.50 (16.89) 


0.48 (17.49) 


0.45 (18.02) 


0.43 (18.34) 


CGDP 


0.65 (7.51) 


0.63 (7.84) 


0.61 (8.14) 


0.59 (8.34) 


0.57 (8.63) 


0.55 (8.90) 


0.52 (9.05) 


CBL 


0.70 (7.39) 


0.68 (7.71) 


0.67 (7.92) 


0.65 (8.12) 


0.63 (8.40) 


0.61 (8.64) 


0.59 (8.79) 


CBR 


0.73 (7.28) 


0.71 (7.62) 


0.70 (7.81) 


0.69 (8.01) 


0.67 (8.29) 


0.65 (8.53) 


0.63 (8.69) 


LASSO 


(") 


(") 


(") 


(") 


(-) 


(-) 


(-) 


RR 


0.48 (11.26) 


0.44 (11.87) 


0.42 (12.19) 


0.40 (12.46) 


0.37 (12.99) 


0.33 (13.47) 


0.31 (13.71) 


PLSR 


0.54 (11.03) 


0.50 (11.60) 


0.49 (11.73) 


0.48 (11.83) 


0.44 (12.46) 


0.40 (12.92) 


0.38 (13.05) 



Table 5: Squared correlation between the predicted and observed responses for J^k- The 
figures in the bracket represent length of the 95% Pi's for compressed models. 



responses 


resp3 


rcsp4 


rcsp5 


rcsp6 


rcsp7 


rcsp8 


resp9 


BCR 


0.46 (11.53) 


0.43 (11.83) 


0.39 (11.95) 


0.34 (12.11) 


0.31 (12.25) 


0.27 (12.68) 


0.22 (12.99) 


CGDP 


0.55 (5.76) 


0.50 (5.95) 


0.46 (6.09) 


0.45 (6.12) 


0.42 (6.17) 


0.38 (6.32) 


0.35 (6.43) 


CBL 


0.62 (5.59) 


0.58 (5.75) 


0.56 (5.84) 


0.53 (5.89) 


0.51 (5.95) 


0.46 (6.11) 


0.42 (6.22) 


CBR 


0.66 (5.48) 


0.62 (5.66) 


0.60 (5.73) 


0.57 (5.80) 


0.55 (5.88) 


0.51 (6.04) 


0.47 (6.16) 


LASSO 


(") 


(-) 


(-) 


(-) 


(-) 


(") 


(-) 


RR 


0.38 (8.07) 


0.33 (8.41) 


0.29 (8.57) 


0.26 (8.63) 


0.21 (8.88) 


0.16 (9.24) 


0.13 (9.45) 


PLSR 


0.51 (7.03) 


0.46 (7.30) 


0.42 (7.45) 


0.38 (7.55) 


0.34 (7.73) 


0.28 (8.03) 


0.24 (8.14) 



part to the ability to characterize sparse as well as dense cases in which predictors can be 
compressed to a lower- dimensional linear subspace with minimal loss of information about 
the response. 

There are many natural directions in terms of future research. Firstly, the random projec- 
tion approach can be directly extended to cases involving matrix or tensor-valued predictors, 
which increasingly arise in modern applications. For example, if the predictor for subject 
i is matrix-valued, then one can pre- and post-multiply by random projection matrices to 
compress to a lower-dimensional matrix that will be easier to handle computationally. Sim- 
ilarly, in large n and p settings one can compress the enormous n x p design matrix by pre- 
and post-multiplication to obtain a smaller data set, while also appropriately compressing 
the response. This would combine compressed sensing and compressed regression. 

Although our focus has been on Bayesian approaches in GLMs, compressed regression can 
similarly be used much more broadly. For example, one can estimate frequentist regression 
models for many different randomly compressed predictor vectors, with the results averaged 
in performing predictions. This approach is reminiscent of random forests, but is quite 
different in nature and could be referred to as compressed averaging (Caving). Finally, 
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although the dramatic computational gains are attributable to the use of random projections 
generated in advance of the analysis, there is the possibility of refining these projections based 
on the available data; ideally this could be done in a manner that only adds modestly to the 
computational burden. 
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7 Appendix 



Define, 



do(fJo) 



d(f,fo? 



dt(fjo) 




Assume V n is a sequence of sets of probability densities. Let N(e n , V n ) is the minimum 
number of Hellinger balls of radius e n needed to cover V n . 
Define the following conditions: 

1. log N(e n ,V n ) < ne 2 n for all large n 

2. 7r(7>£) < e" 2n ^ for all large n 



Our proof will be complete if we can show conditions 1, 2, 3 hold in our case with some t. 
We prove them for t — 1. 

First we will state and prove two propositions which will be used subsequently to prove 
the main results. 

Proposition 7.2 Assume (3 is assigned iV(0, S^) apriori. Then 



where X ~ Pois(f), Y ~ Pois(f ) uritfi Ax = (#as) ,^ (#iB) , A = ( ^y£f^ x) 

Proof Note that ($x)'(3 ~ iV(0, (SaO'E^Sa:)). This implies that ^Sfrr^f ~ Xi(A). 



Invoking the popular representation of noncentral x 2 density with mixture of gamma densi- 
ties, one gets 



3. n[f : d t (f, f ) < e f]> e~ n ^ for all large n 



It has been proved in Jiang (2005) that 

Proposition 7.1 If ne 2 n — > oo then under 1, 2, 3 (for some t > 0), we have 



E f[) n [d(f, f ) > 4e n | (y h Xi )U < 4e~<* 



P(\(&x)'(3 - x'(3 \ < A) > P(X - Y > 2), 



P(\(3>x)'f3-x'f3 \ <A) = P 



\(<f>x)'(3 -x'(3, 



< A 





(19) 



i=0 
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where Z 1+2i ~ x l+x- Note that p ( z i+2i < Ai) > P(Z 2+2i < A x ) = P(G < AO, where 
G ~ Gamma(^, 1 + Assume, Hn, H^i +i ) ~ ea;p(|) for i G N. 

Consider a Poisson process with interarrival times Hn, Hi(i + 1) respectively. If 

iV(Ai) be the number of events upto time Aj then iV(Ai) = X (using results from the 
theory of Poisson process) and 

{H a + --- + H i{i+1) < At} = {iV(Ai) > i + 1} 

Using the above results, ( [19] ) can be written as 

£ ^-^P(Zi +ai < AO > £ ^-^PCSh* < AO 

i=0 ' i=Q 

= E -^-p < M =E ^H- w.) > 1 + 

i=0 \fc=l / *=0 

00 _A/A\? co _A / A\j 



£ —^p(x >i + i) = E hr^ p(A ' s 2 + *> 

i=0 ' i=0 

*Y~Po*.(4) > 2 + •)] = P(* > 2 + y) = P(X - Y > 2) 



We seek to obtain proposition of similar spirit when (3j ~ DE(1) i.i.d prior. Using the fact 
that the DE(1) distribution is the scale mixture of normal distribution, we have 

^\rf^ d N(0,Tf), j = l,...,m n 
2 



Stacking them up, we obtain f3 = (/?i, ...,/3 mn )' ~ N(0,D T ), where -D T = diag(rf, ...,r^ n ). 



The next proposition comes in the same spirit as Proposition 7.2 The proof of this propo 



sition follows along the same line as of Proposition 7.2 and is, therefore, omitted. 



Proposition 7.3 P(\(*x)'(3 - x'(3 \ < A | rj 2 , t£J > P(X 2 -Y 2 >2\ if, r^J, where 
X 2 ~ Po iS (f), Y 2 ~ Pds(^) with A 2 = { ^ y ^ x) , X 2 = (^yfefey 

Proof of the Theorem I3.lt 



Proof We will check the three conditions with t = 1. Let b n = y8B n ne^ 

condition 1: Assume V n be the set of all densities that can be represented by (3 s.t. 
< b n , V j. Lets consider l°° balls of the form (aj — 5,aj + 5) for each coordinate with 

the center of each ball inside V n . It is not difficult to see that one needs at most (%r + l) m ™ 

such balls to cover V n . 

Let f u be any density in V n . Therefore, 3 7 s.t. u = (<&£c)'7, < b n and 

fu(y) = exp {ya{u) + b{u) + c(y)} . 
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Let 7j G (atj — 5, acj + 5) V j, s.t. \jj — <X/| < 5 and |<x,| < 6 n . Let, v = (<frx)'a and 

fv(y) = exp {ya(w) + b(v) + c(y)} . 

We will use the well known fact that d(f u ,f v ) < {^d (f u , fv)} 1 ^ 2 and bound on do(f u ,f v ). 
Note that, 




do(fujv) = I I fvlog \y) v y {dy)v x {dx) 




[y(a(v) - a(u)) + (b(v) - b(u))) f v v y (dy) )■ v x {dx) 
b'(v) 



a(v) — a(u)) 
(v - u) \ (a'(u v ) 



a'(v) 
V{v) 
a'(v) 



+ (b(v) - b{u)) | u x (dx) 
+ b'(u v ) \ u x (dx) 



(20) 



where the last step follows from mean value theorem, u v is an intermediate point between u 
and v. Applying Cauchy-Schwartz inequality we have, 

\v -u\ = \(<$>x)'(j - a) I < ||*aj|| ||7 - a|| < v^||a||<5 < VPn m J = QJ 

With a similar argument one can show that \v\ < b n 6 n and \u\ < b n 6 n . Therefore, \u v \ < b n 6 n . 
Using these results and (20) 

1 . , . b'{h) 



do(fujv) < sup \a'(h)\ sup 



\h\<b n e n 



\h\<b n e n 



a'(h) 



9J, 



} 1/2 

Using the well known fact that d(f u , f v ) < {^d (f u , f v )\ we obtain 



d(f u , fv) < \ I sup \a'(h)\ sup 

\h\<b n 9 n \h\<b n 0„ 



b'(h) 



a'(h) 



ej. 



Choosing 5 = — 

sup |a'(/i)| sup 

\h\<b n e n \h\<b n e n 
V n is bounded above by 



b'(h) Yj. 1 



one gets d(f u , f v ) < e n . Therefore, the entropy of 



bnMr, 



sup sup 



z n \h\<b„9 n 



\h\<b„e n 



V{h) 



a'(h) 



1 1 DjprA 



< 



D(b n e n ) 



Therefore, logN(e n ,V n ) < mjog 1 
condition 1 follows. 



. Using the assumptions given in the proposition, 
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condition 2: Note that 

m n 

= 7r(U£J|&| > K\) < 52*m > b n ). 



3=1 



By Mills ratio this quantity is bounded above by 2m n exp { fe "/ 2 £gJ _ 2m n ( ^ ^^\ Using 

conditions in (i), for all large n, the expression can further be bounded above by exp {— 2ne 2 l }. 
This yields condition 2. 

condition 3: Using proposition 7.2 we obtain, P{\{<&x)' (3 — x'f3 \ < A) > P{X — Y > 2). 
Note that, when X ~ Pois(^) and Y ~ Pois(^-), X — Y follows Skellam distribution with 

P(X - Y = k) = exp{-(A 1 + Ax)} f^j ij fc , (2y/\^ . (21) 

Plugging in Ai and Ai in (21 ), we have 



(#aj)'S^(#aj) j \(x'/3 )V ' ' \ (SaO'E^a:) 
Now we use the fact that for z > 0, J„(z) > 2 I/ z ; T(z/ + 1) (Joshi et al, 1991) to get 

((x'(3 ) 2 + A 2 ) I 2 4 A 4 



P(X - Y > 2) > P(X - Y = 2) > exp 

> exp 

> exp 



((x'{3 ) 2 + A 2 ) I 2 4 A 4 



fiJI^H 2 J B%\\<f>x\\* 
((:r'/3 ) 2 + A 2 )log(^n)l 2 4 A 4 



5i||$:z;|| 2 J m 2 ^ 2 ||$a;|| 4 

Take A = ||, r) will be chosen later. Note that, 2wlogm n + 21og( J B)-41og(2)-41og(A) > 0, 
for all large n and 

q 2« log m n + 2 log(B) - 4 log(2) - 4 log(A) 

o 5 > U, 

ne 2 

from the assumptions. Therefore, ^rfsp^jp > exp{— ne 2 /8} for all 
Also, note that (x'{3 ) 2 < Y, n \Pjo\ < K. Therefore, 

/ ((^/3 ) 2 + A 2 )log(m n )\ \ [K 2 + A 2 ) log(m r 

exp < > > exp 



fiiH^^H 2 j 1 \ B^xW 2 

ne 2 n {K 2 + 1) log(m n ) 1 _ f ne 2 



>6XP ^X ^ >6XP 
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Therefore, P(X — Y>2)> exp{ — Proposition 7.2 then yields 

P (\(&x)'P - x'(3 \ > exp j-^ j for all large n 



For x = xi,...,x n , let S = |/3 : \(&x)' (3 — x' (3 \ < Now, take t = 1. Therefore 

we can write dt(f,fo) = E x \g(ui)((^x)'P — x'f3 )], where g has continuous derivative in 
the neighborhood of x'/3 and m is an intermediate point between (<f>x)'[3 and x'/3 , by 
integrating y and applying a first order Taylor expansion. Choose r] s.t. \g\ < rj in the 
neighborhood [—(if + 1), (if + 1)] for all large n. Then 



\m\ < \(&x)'P - x'P \ + \x'p \ <^ + K 



implies that d t (f, f Q ) < is a subset of S . Hence, condition 3 is satisfied. 
Proof of the Theorem 13. 2t 

Proof Here also we proceed by checking conditions 1, 2, 3 for t — 1. 

condition 1: This proof follows exactly along the same line as in the proof of the 
condition 1. in Theorem 13.11 

condition 2: Note that, 

m n 

<K) = <^7=iW > K\) < $>(|&| > K) = m n exp{-b n }. 

Under the assumptions, this expression can be bounded by exp {— 2ne 2 } for all large n. 
condition 3: Following the same line as in the last Proposition, we obtain 



P(\(<f>x)'f3 - x'P \ < A | r 2 , r 2 J > P(X 2 -Y 2 >2\ jf, r, 



> exp 



x'/3 ) 2 + A 2 ) 2 4 A 4 



' min II 1 1 ' 



Integrating w.r.t (r^ m , T^ ax ) 

/DC /*GO 
7 exp 



J V mm' max) max i 



t 2 lld>T>l|2 I I |<|>T»| |4t-4 v rmn> max/ max mm 
1 min\\^ f ' v \\ ) 11^*^11 'mm 



/ / i l 2 ||cb*T»ll2 [ 1 1 (fosr* I |4t4 \ mini 1 max / max 1 *" min 



/l; -2 ^2 ,.,.2 ,,_2 

' max 



j (xU) 2 + A 2 | 2 4 A 4 /"» 1 2 2 2 2 

J> exp < ii^r_ih I ii^f.„ii4 / / _4 J vmin' ^~max ) ^~rnax ^~mi 



\<§>x\\ 2 | ||$a3|| 4 ./i 7^2 T max 



m i n 
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Note that 



oo poo 




2 Tl 



il 2 2 W r 2 ^ T 2 

j V mini ' max) max min 



mm 
"OO ^OO 




m r 



2 2 2 r 4 

1 ' max 



2 7 2 2 r 4 

' max 



cxp 



cxp 



(t 2 +r 

V max 1 ) 



2 > 
mini 



cxp 



< 



cxp 



- exp 
T 



2 

exp 



exp f - exp 
|}-exp 



— exp 



drldrL. 



max mm 



r 2 . 

mm 



dr 2 dr 2 



mm" mm 



Let, C n — f 2 ^r— ■/(' r mm; T max)dT max d' Therefore 

P(|($x)'/3 - x'/3 | < A) > exp |_ (:E,/3o)2 + A ' 



$33 



Take A = Note that for all large n, x = Xi, x n . 

4 log 1 1 -41og(2) -41og( A) -log(C n ) 

° 2 

ne 2 

from the assumptions. Therefore, Tj^Tpr > exp{— ne 2 /8}. 
Also, note that (x'f3 ) 2 < ]T) n IA*o| < K. Therefore, 



2 4 A 4 

WxW n 



exp 



> exp 



((*73 ) 2 + A 2 ) 



nc 



$£C|| 2 

2 



> exp 



(K 2 + A 2 
II<E»:e|| 2 



Proposition 7.3 then yields 



P[\(*x)'(3-x'(3 \ < ^ ] x>xp 



net 



for all large n 



m n — 2 



dr 2 dr 2 • 

max mm 



Let 5 = j/3 : |(<&£c)'/3 — £c'/3 | < ^ j. Now, take t = 1, therefore we can write c?t(/, /o) = 

-^x[5 , (Mi)(($^) / /3 — x'flo)], where Ui is an intermediate point between (&x)'/3 and x'(3 , by 
integrating ?/ and applying a first order Taylor expansion. Now choose rj as in the proof of 
the previous theorem. Then 

e 2 

\ Ui \ < \(&x)'p - x'/3 \ + \x'/3 \ <^ + K 
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implies that d t (f, /o) < e -f is a subset of S. Hence, condition 3 is satisfied. 
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